In this project, I aim to apply several machine learning techniques for binary classification using aflatoxin-contaminated corn kernel datasets. The algorithms I’ll explore are Partial Least Square Discriminant Analysis (PLS-DA), Random Forest (RF), Support Vector Machine (SVM), and Gradient Boosting Machine (GBM).
Separation high (HC) and low (LC) level aflatoxin contamination based on SNI 01-3929-2006, which the maximum level of aflatoxin for poultry feed in Indonesia is 50 ppb.
For this analysis, I focus exclusively on the SC212M x PHW79 corn hybrid. This dataset comprises 247 samples, with 107 classified as HC and 140 as LC.
Importing the dataset and remove the unused data.
#import the datasetsaflatoxin_data <-read.csv("kernel_data/spectralsignatures.csv", header = T)#select the "SC212M x PHW79" samplesdf <-subset(aflatoxin_data, aflatoxin_data$Hybrid =="SC212M x PHW79")#Create a reference, and split as high and low contaminant AF_ref <- df[,c(1,3)]AF_ref$contaminant <-ifelse(AF_ref$AF_level <=50, "LC", "HC")AF_ref$contaminant <-as.factor(AF_ref$contaminant)#remove the second to forth columnsdf <- df[,-c(2:4)]#remove "stray light", which can introduce unwanted noise. Remove first 50 and last 50 wavenumberdf <- df[,-c(2:51, (ncol(df)-49):ncol(df))]